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Abstract 

Kernel methods are an extremely popular set of techniques used for many important machine learning and 
data analysis applications. In addition to having good practical performance, these methods are supported by 
a well-developed theory. Kernel methods use an implicit mapping of the input data into a high dimensional 
feature space defined by a kernel function, i.e., a function returning the inner product between the images 
of two data points in the feature space. Central to any kernel method is the kernel matrix, which is built by 
evaluating the kernel function on a given sample dataset. 

In this paper, we initiate the study of non-asymptotic spectral theory of random kernel matrices. These 
are n X n random matrices whose (i, j)th entry is obtained by evaluating the kernel function on and Xj, 
where xi,..., x„ are a set of n independent random high-dimensional vectors. Our main contribution is to 
obtain tight upper bounds on the spectral norm (largest eigenvalue) of random kernel matrices constructed by 
commonly used kernel functions based on polynomials and Gaussian radial basis. 

As an application of these results, we provide lower bounds on the distortion needed for releasing the coef¬ 
ficients of kernel ridge regression under attribute privacy, a general privacy notion which captures a large class 
of privacy definitions. Kernel ridge regression is standard method for performing non-parametric regression 
that regularly outperforms traditional regression approaches in various domains. Our privacy distortion lower 
bounds are the first for any kernel technique, and our analysis assumes realistic scenarios for the input, unlike 
all previous lower bounds for other release problems which only hold under very restrictive input settings. 


1 Introduction 

In recent years there has been significant progress in the development and application of kernel methods for many 
practical machine learning and data analysis problems. Kernel methods are regularly used for a range of problems 
such as classification (binary/multiclass), regression, ranking, and unsupervised learning, where they are known 
to almost always outperform “traditional” statistical techniques ll2^ |24l . At the heart of kernel methods is the 
notion of kernel function, which is a real-valued function of two variables. The power of kernel methods stems 
from the fact for every (positive definite) kernel function it is possible to define an inner-product and a lifting 
(which could be nonlinear) such that inner-product between any two lifted datapoints can be quickly computed 
using the kernel function evaluated at those two datapoints. This allows for introduction of nonlinearity into the 
traditional optimization problems (such as Ridge Regression, Support Vector Machines, Principal Component 
Analysis) without unduly complicating them. 

The main ingredient of any kernel method is the kernel matrix, which is built using the kernel function, 
evaluated at given sample points. Formally, given a kernel function k : A x A —)■ M and a sample set xi,..., x„, 
the kernel matrix AT is an n x n matrix with its (f, j)th entry Kij = Common choices of kernel 

functions include the polynomial kernel («;(xj,Xj) = (a(xj,Xj) + by, for p G N) and the Gaussian kernel 
(K(xj, Xj) = exp(—a||xj — x^ |p), for a > 0) 112^1^411 . 

In this paper, we initiate the study of non-asymptotic spectral properties of random kernel matrices. A 
random kernel matrix, for a kernel function k, is the kernel matrix K formed by n independent random vectors 
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xi,..., G The prior work on random kernel matrices Euan have established various interesting 
properties of the spectral distributions of these matrices in the asymptotic sense (as n, d —)• oo). However, 
analyzing algorithms based on kernel methods typically requires understanding of the spectral properties of 
these random kernel matrices for large, but fixed n,d. A similar parallel also holds in the study of the spectral 
properties of “traditional” random matrices, where recent developments in the non-asymptotic theory of random 
matrices have complemented the classical random matrix theory that was mostly focused on asymptotic spectral 
properties Il27ll20ll . 

We investigate upper bounds on the largest eigenvalue (spectral norm) of random kernel matrices for poly¬ 
nomial and Gaussian kernels. We show that for inputs xi,... ,x„ drawn independently from a wide class of 
probability distributions over (satisfying the subgaussian property), the spectral norm of a random kernel ma¬ 
trix constructed using a polynomial kernel of degree p, with high probability, is roughly bounded by 0{d^n). In a 
similar setting, we show that the spectral norm of a random kernel matrix constructed using a Gaussian kernel is 
bounded by 0(n), and with high probability, this bound reduces to 0(1) under some stronger assumptions on the 
subgaussian distributions. These bounds are almost tight. Since the entries of a random kernel matrix are highly 
correlated, the existing techniques prevalent in random matrix theory can not be directly applied. We overcome 
this problem by careful splitting and conditioning arguments on the random kernel matrix. Combining these with 
subgaussian norm concentrations form the basis of our proofs. 

Applications. Largest eigenvalue of kernel matrices plays an important role in the analysis of many machine 
learning algorithms. Some examples include, bounding the Rademacher complexity for multiple kernel learn¬ 
ing ifThll . analyzing the convergence rate of conjugate gradient technique for matrix-valued kernel learning |[26l . 
and establishing the concentration bounds for eigenvalues of kernel matrices lfT^l25l . 

In this paper, we focus on an application of these eigenvalue bounds to an important problem arising while 
analyzing sensitive data. Consider a curator who manages a database of sensitive information but wants to release 
statistics about how a sensitive attribute (say, disease) in the database relates with some nonsensitive attributes 
(e.g., postal code, age, gender, etc). This setting is widely considered in the applied data privacy literature, 
partly since it arises with medical and retail data. Ridge regression is a well-known approach for solving these 
problems due to its good generalization performance. Kernel ridge regression is a powerful technique for building 
nonlinear regression models that operate by combining ridge regression with kernel methods llTn |^ We present 
a linear reconstruction attac^j^that reconstructs, with high probability, almost all the sensitive attribute entries 
given sufficiently accurate approximation of the kernel ridge regression coefficients. We consider reconstruction 
attacks against attribute privacy, a loose notion of privacy, where the goal is to just avoid any gross violation of 
privacy. Concretely, the input is assumed to be a database whose zth row (record for individual i) is (xj, yj) where 
Xj G is assumed to be known to the attacker (public information) and yi G {0,1} is the sensitive attribute, 
and a privacy mechanism is attribute non-private if the attacker can consistently reconstruct a large fraction of 
the sensitive attribute (yi,..., y„). We show that any privacy mechanism that always adds « o( 1 / {dPn) ) nois^ 
to each coefficient of a polynomial kernel ridge regression model is attribute non-private. Similarly any privacy 
mechanism that always adds ss o(l) noiseQto each coefficient of a Gaussian kernel ridge regression model is 
attribute non-private. As we later discuss, there exists natural settings of inputs under which these kernel ridge 
regression coefficients, even without the privacy constraint, have the same magnitude as these noise bounds, 
implying that privacy comes at a steep price. While the linear reconstruction attacks employed in this paper 
themselves are well-known ||9l[T5lO, these are the first attribute privacy lower bounds that: (i) are applicable 
to any kernel method and (ii) work for any d-dimensional data, analyses of all previous attacks (for other release 

* We provide a brief coverage of the basics of kernel ridge regression in Sectionj^ 

^In a linear reconstruction attack, given the released information p, the attacker constructs a system of approximate linear equalities 
of the form Az ~ p for a matrix A and attempts to solve for z. 

^Ignoring the dependence on other parameters, including the regularization parameter of ridge regression. 
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problems) require d to be comparable to n. Additionally, unlike previous reconstruction attack analyses, our 
bounds hold for a wide class of realistic distributional assumptions on the data. 

1.1 Comparison with Related Work 

In this paper, we study the largest eigenvalue of an n x n random kernel matrix in the non-asymptotic sense. The 
general goal with studying non-asymptotic theory of random matrices is to understand the spectral properties 
of random matrices, which are valid with high probability for matrices of a large fixed size. This is contrast 
with the existing theory on random kernel matrices which have focused on the asymptotics of various spectral 
characteristics of these random matrices, when the dimensions of the matrices tend to infinity. Let xi,..., x„ G 

be n i.i.d. random vectors. For any F : X X M —)> M, symmetric in the first two variables, consider 

the random kernel matrix K with {i, j)th entry Kij = F{^i, Xj , d). El Karoui ifT^ considered the case where K 
is generated by either the inner-product kernels (i.e., F{-Ki,-x.j, d) = d)) or the distance kernels (i.e., 

F{-Ki,:x.j,d) = /(||xj — Xjlp, d)). It was shown there that under some assumptions on / and on the distributions 
of Xj’s, and in the “large d, large n” limit (i.e., and d, n —)■ cx) and d/n —)• (0, oo)): a) the non-linear kernel 
matrix converges asymptotically in spectral norm to a linear kernel matrix, and b) there is a weak convergence of 
the limiting spectral density. These results were recently strengthened in different directions by Cheng et al. fTl 
and Do et al. Q. To the best of our knowledge, ours is the first paper investigating the non-asymptotic spectral 
properties of a random kernel matrix. 

Like the development of non-asymptotic theory of traditional random matrices has found multitude of ap¬ 
plications in areas including statistics, geometric functional analysis, and compressed sensing lITTIl . we believe 
that the growth of a non-asymptotic theory of random kernel matrices will help in better understanding of many 
machine learning applications that utilize kernel techniques. 

The goal of private data analysis is to release global, statistical properties of a database while protecting the 
privacy of the individuals whose information the database contains. Differential privacy Q is a formal notion of 
privacy tailored to private data analysis. Differential privacy requires, roughly, that any single individual’s data 
have little effect on the outcome of the analysis. A lot of recent research has gone in developing differentially 
private algorithms for various applications, including kernel methods ifTTIl . A typical objective here is to release 
as accurate an approximation as possible to some function / evaluated on a database D. 

In this paper, we follow a complementary line of work that seeks to understand how much distortion (noise) 
is necessary to privately release some particular function / evaluated on a database containing sensitive informa¬ 
tion mu mini aim a [min. The general idea here, is to provide reconstruction attacks, which are attacks 
that can reconstruct (almost all of) the sensitive part of database D given sufficiently accurate approximations to 
f{D). Reconstruction attacks violate any reasonable notion of privacy (including, differential privacy), and the 
existence of these attacks directly translate into lower bounds on distortion needed for privacy. 

Linear reconstruction attacks were first considered in the context of data privacy by Dinur and Nissim |'5l, who 
showed that any mechanism which answers n log n random inner product queries on a database in {0,1}” with 
o(\/n) noise per query is not private. Their attack was subsequently extended in various directions by ll^l9l[T8ll3l. 

The results that are closest to our work are the attribute privacy lower bounds analyzed for releasing fc-way 
marginals ifTSl l4l. linear/logistic regression parameters llT4ll . and a subclass of statistical M-estimators 1(141 . 
Kasiviswanathan et al. HH showed that, if d = then any mechanism which releases all fc-way 

marginal tables with o{^/n) noise per entry is attribute non-private^ These noise bounds were improved by 
De 14i], who presented an attack that can tolerate a constant fraction of entries with arbitrarily high noise, as long 
as the remaining entries have o{y/n) noise. Kasiviswanathan et al. fT4l recently showed that, if d = D(n), then 
any mechanism which releases d different linear or logistic regression estimators each with o{l/^/n) noise is 
attribute non-private. They also showed that this lower bound extends to a subclass of statistical M-estimator 

^The fi notation hides polylogarithmic factors. 
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release problems. A point to observe is that in all the above referenced results, d has to be comparable to n, and 
this dependency looks unavoidable in those results due to their use of least singular value bounds. However, in 
this paper, our privacy lower bounds hold for all values of d, n (d could be ^ n). Additionally, all the previous 
reconstruction attack analyses critically require the Xj’s to be drawn from product of univariate subgaussian 
distributions, whereas our analysis here holds for any d-dimensional subgaussian distributions (not necessarily 
product distributions), thereby is more widely applicable. The subgaussian assumption on the input data is quite 
common in the analysis of machine learning algorithms ||T1. 

2 Preliminaries 

Notation. We use [n] to denote the set {1,..., n}. dni-,-) measures the Hamming distance. Vectors used 
in the paper are by default column vectors and are denoted by boldface letters. For a vector v, denotes its 
transpose and || v|| denotes its Euclidean norm. For two vectors vi and V 2 , (vi, V 2 ) denotes the inner product of 
vi and V 2 . For a matrix M, \\M\\ denotes its spectral norm, HMHj? denotes its Frobenius norm, and Mij denotes 
its (i, j)th entry. represents the identity matrix in dimension n. The unit sphere in d dimensions centered at 
origin is denoted by S'^~^ = {z : ||z|| = l,z G M'^}. Throughout this paper C,c,C', also with subscripts, 
denote absolute constants (i.e., independent of d and re), whose value may change from line to line. 

2.1 Background on Kernel Methods 

We provide a very brief introduction to the theory of kernel methods; see the many books on the topic ll2^ l24]l 
for further details. 

Definition 1 (Kernel Function). Let X be a non-empty set. Then a function k : X x X ^ M is called a kernel 
function on X if there exists a Hilbert space TL over M and a map f : X ^ TL such that for all x, y G A, we 
have 

K(x,y) = ((/>(x),(/)(y))-re. 

For any symmetric and positive semidefinit^k&m&\ k, by Mercer’s theorem ifTTl there exists: (i) a unique 
functional Hilbert space TL (referred to as the reproducing kernel Hilbert space. Definition on A such that 
k{-, ■) is the inner product in the space and (ii) a map f defined as fix) := re(-,xj^that satisfies Definifion[^ 
The funcfion f is called the feature map and fhe space TL is called the feature space. 

Definition 2 (Reproducing Kernel Hilbert Space). A kernel k{-, •) is a reproducing kernel of a Hilbert space TL if 
V/ G TL, /(x) = («:(•, x), /(•))'re- For a (compact) X C and a Hilbert space TL of functions / : A —)• M, we 
say TL is a Reproducing Kernel Hilbert Space if there 3k : A x X ^ M., s.t.: a) k has the reproducing property, 
and b) k spans TL = span{K {-: x G X}. 

A standard idea used in the machine-learning community (commonly referred to as the “kernel trick”) is that 
kernels allow for the computation of inner-products in high-dimensional feature spaces ))■}{) using 

simple functions defined on pairs of inpuf paffems (k(x, y)), wifhouf knowing fhe f mapping explicifly. This 
frick allows one to efficiently solve a variety of non-linear optimization problems. Note that there is no restriction 
on the dimension of the feature maps {cj){-x.)), i.e., it could be of infinite dimension. 

Polynomial and Gaussian are two popular kernel functions that are used in many machine learning and data 
mining tasks such as classification, regression, ranking, and structured prediction. Fet the input space X = 

For X, y G these kernels are defined as: 

positive definite kernel is a function k : A x A —>■ R such that for any n > 1, for any finite set of points in A and real 

numbers {ai}”'^i, we have “iijKfxi, x_,) > 0. 

*fv(', x) is a vector with entries k(x', x) for all x' € A. 
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(1) Polynomial Kernel: /t(x, y) = (a(x, y) + by, with parameters a, 6 G M and p £ N. Here a is referred 
to as the slope parameter, 6 > 0 trades off the influence of higher-order versus lower-order terms in the 
polynomial, and p is the polynomial degree. For an input x G the feature map i?!)(x) of the polynomial 
kernel is a vector with a polynomial in d number of dimensions ll2^ . 

(2) Gaussian Kernel (also frequently referred to as the radial basis kernel): «:(x, y) = exp (—a||x — y|p) 

with real parameter a > 0. The value of a controls the locality of the kernel with low values indicating 
that the influence of a single point is “far” and vice-versa ll2^ . An equivalent popular formulation, is to set 
a = and hence, fi:(x, y) = exp (—||x — y|p/2(T^). For an input x G the feature map </>(x) of 

the Gaussian kernel is a vector of infinite dimensions 12^ . Note that while we focus on the Gaussian kernel 
in this paper, the extension of our results to other exponential kernels such as the Laplacian kernel (where 
k(x, y) = exp (—a||x — y ||i)), is quite straightforward. 

2.2 Background on Subgaussian Random Variables 

Let us start by formally defining subgaussian random variables and vectors. 

Definition 3 (Subgaussian Random Variable and Vector). We call a random variable x G M subgaussian if there 
exists a constant C > 0 //’Pr[|x| > f] < 2exp(—f^/C'^)/or all t > ^. We say that a random vector x G is 
subgaussian if the one-dimensional marginals (x, y) are subgaussian random variables for all y G M'^. 

The class of subgaussian random variables includes many random variables that arise naturally in data anal¬ 
ysis, such as standard normal, Bernoulli, spherical, bounded (where the random variable x satisfies |x| < M 
almost surely for some fixed M). The natural generalization of these random variables to higher dimension are 
all subgaussian random vectors. For many isotropic convex (such as the hypercube), a random vector x 

uniformly distributed in 1C is subgaussian. 

Definition 4 (Norm of Subgaussian Random Variable and Vector). The f! 2 -norm of a subgaussian random vari¬ 
able X G M, denoted by WxW^^ is: 

||x ||^2 = inf {f > 0 : E[exp(|xp/f^)] < 2} . 

The 'ilj 2 -norm of a subgaussian random vector x G is: 

||x||^2 = sup ||(x,y)||^2- 

Claim 1 (Vershynin llJTl '). Let X G be a subgaussian random vector. Then there exists a constant C > 0, 
such that Pi[\x\ > t] < 2 exp(—Cf^/HxH^^). 

Consider a subset T of M'^, and let e > 0. An e-net of T is a subset AA C T such that for every x G T, there 
exists a z G AA such that ||x — z|| < e. We would use the following well-known result about the size of e-nets. 

Proposition 2.1 (Bounding the size of an e-Net ll27l l. Let T be a subset of and let e > 0. Then there exists 
an e-net ofT of cardinality at most (1 -|- 2/e)‘^. 

The proof of the following claim follows by standard techniques. 

Claim 2 ( ijT/l l. Let M be a 1 /2-net of Then for any :ie G ||x|| < 2maxyg_v 

^ A convex set K, in R"* is called isotropic if a random vector chosen uniformly from K. according to the volume is isotropic. A random 
vector X G R*^ is isotropic if for all y G R'*, E[(x, y)^] = ||y|p. 
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3 Largest Eigenvalue of Random Kernel Matrices 


In this section, we provide the upper bound on the largest eigenvalue of a random kernel matrix, constructed 
using polynomial or Gaussian kernels. Notice that the entries of a random kernel matrix are dependent. For 
example any triplet of entries (i, j), {j, k) and (/c, i) are mutually dependent. Additionally, we deal with vectors 
drawn from general subgaussian distributions, and therefore, the coordinates within a random vector need not be 
independent. 

We start off with a simple lemma, to bound the Euclidean norm of a subgaussian random vector. A random 
vector X is centered if E[x] = 0. 

Lemma 3.1. Let xi,..., x„ be independent centered subgaussian vectors. Then for all i G [n], Pr[||xj || > 
Csfd\ < ex.p{—C'd) for constants C,C'. 

Proof To this end, note that since x* is a subgaussian vector (from Definition 


Pr 


|(xi,y)| > Cs/dl2 


< 2 exp(-C' 2 d), 


for constants C a nd C 2 , any unit vector y G 5^ Taking the union bound over a (1/2)-net {M) in 5“^ and 

using Proposition [ 2 ^ for the size of the nets (which is at most 5*^ as e = 1/2), we get that 


Pr 


max|(xi,y)| > Cy/d/2 
y&M 


< exp(-C' 3 (i), 


From Claimj^ we know that ||xj|| < 2 maxyg^y (xj, y). Hence, Pr 


Xi 


> Cs/d 


< exp(—C"d). 


□ 


Polynomial Kernel. We now establish the bound on the spectral norm of a polynomial kernel random matrix. 
We assume xi,..., x„ are independent vectors drawn according to a centered subgaussian distribution over 
Let Kp denote the kernel matrix obtained using xi,..., x„ in a polynomial kernel. Our idea to split the kernel 
matrix Kp into its diagonal and off-diagonal parts, and then bound the spectral norms of these two matrices 
separately. The diagonal part contains independent entries of the form (a||xj|p -|- by, and we use Lemma 3.1 


to 


bound its spectral norm. Dealing with the off-diagonal part of Kp is trickier because of the dependence between 
the entries, and here we bound the spectral norm by its Frobenius norm. We also verify the upper bounds provided 
in the following theorem by conducting numerical experiments (see Figure [T(a)]). 


Theorem 3.2. Let xi,..., x,i G be independent centered subgaussian vectors. Let p G N, and let Kp be the 
n X n matrix with {i, j)th entry Kp.. = (a(xj, Xj) -|- by. Assume that n < exp(Cid) for a constant Ci. Then 
there exists constants Cq , Cq such that 


Pr [ll-f^pll > -h 2P’''^|6|^n] < exp(—Cod). 


Proof. To prove the theorem, we split the kernel matrix Kp into the diagonal and off-diagonal parts. Let Kp = 
D + W, where D represents the diagonal part of Kp and W the off-diagonal part of Kp. Note that 


P^pW — 11 .^11 T 


< \\D\\ + 


F- 


Let us estimate the norm of the diagonal part D first. From Lemma 3.1 we know that for all i G [n] with 
C 3 = C', 


Pr 


Xi 


> CVd 


= Pr 


> {Cs/yy < exp(—C 3 d). 
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Instead of ||x||^, we are interested in bounding (a||xj|p + b)P. 


Pr 


Xi 


> {CVdf 


= Pr 


(a||xj|p + b)^ > {a{C^/df + by 


( 1 ) 


Consider {a{CVdy + by. A simple inequality to bound {a{CVdy + by i^ 

{a{CVdy + by < 2 P{\a\P{CVdyP + \b\P). 

Therefore, 


Pr 


< Pr 


(a||xj|p + by > {a{Cy/dy + by 


(o||xi||2 + by > 2P{\a\P{CVdyP + |6|P) 

Using ([T]l and substituting in the above equation, for any i G [n] 

Pr [(a||xi||2 + by > 2P{\a\PC^PdP + \b\P)] < Pr [||xi|| > CVd] < expi-Csd). 

By applying a union bound over all n non-zero entries in D, we get that for all i G [n] 

Pr [(a||xi||2 + by > 2Pi\a\PC^PdP + |6|P)] < n ■ exp(—Csd) < exp(C'id) • exp(—Csd) < exp(—(74^), 

as we assumed that n < exp(C'id). This implies that 

Pr[||D|| > 2P{\a\PC^PdP + \b\P)] < exp(-C4d). (2) 

We now bound the spectral norm of the off-diagonal part W using Frobenius norm as an upper bound on the 
spectral norm. Firstly note that for any y G the random variable (xj, y) is subgaussian with its ^/; 2 -norm at 
most Csllyll for some constant C^. This follows as: 

||(xi,y )||^2 := inf {f > 0 : E[exp((xi,y)< 2 } < CsHyH. 

Therefore, for a fixed Xj, 11 (xj, Xj) 11^ C's 11Xj 11. For i y j, conditioning on Xj, 

Pr[|(xi,xj)| > r] =Ex^. [Pr [|(xi,Xj)| > r | x^]] . 

From Claim [TJ 


Ex^ [Pr [|(xj,Xj)| > r | xj]] < E, 


exp 






< Ex. 


exp 


-Cqt^ 

(C5||X,-||)^ 


= E^ 


exp 


—CrT^ 


X,' 


where the last inequality uses the fact that ||(xj,Xj )||^2 ^ C' 5 ||xj ||. Now let us condition the above expectation 
on the value of ||xj || based on whether ||xj || > Cy/d or ||xj || < CVd. We can rewrite 


E^ 


-CjT 


2l 


Xi 


< E, 


exp 


-Cjy 

C^d 


[xo'ii < cy~d 


+ E, 


Pr[||xj|| < CVd] 


exp 


Xi 


X 


■Jll — 


> C^/d 


Pr[||xj|| > CVd], 


®For any a, 6, m G R and p ^ N, {a ■ m + by < 2^(|a|^|m|^ -|- |6|^). 


7 

































The above equation can be easily be simplified as: 


IE> 


From Lemma 


— CjT^ 


Xi 


< exp 


-CsT^ 

d 


+ lEj 


exp 


— CjT^ 


X,' 


X 


> Cy/d 


3.1 


Pr[||xj|| > C^/d] < exp(—Cad), and 

-C7T^ 




exp 


Xi 


|xj|| > CVd 


■j\\ — 


< 1 . 


Pr[||xj|| > CVd], 


This implies that as Pr[||x jll > CVd] < exp(—Csd)), 


E^ 


exp 


-C7r2 


Xi 


|xj|| > CVd 


Pr[||xj|| > CVd] < exp(—Csd). 


Putting the above arguments together, 

Pr [|(xj,Xj)| > r] = Ex^. [Pr [|(xi,Xj)| > r | x^]] < exp ^ ^ + exp(-C'3d). 

Taking a union bound over all (n^ — n) < v? non-zero entries in W, 

'-CsT^' 


Pr 


max|(xi,Xj)| > r < ( exp 


d 


+ exp(-C3ci) . 


Setting T = C ■ din the above and using the fact that n < exp(C'i(i), 


Pr 


max |(xj, Xj)| > C ■ d 


< exp{—Cgd). 


(3) 


We are now ready to bound the Frobenius norm of W. 

II^IIf = |^^(a(x,,x,) + < {n^2^P (|a|2^’(x,, x,)^^ + \b\‘^P)f^ < n2P (|an(x„x,-)r + m 

Plugging in the probabilistic bound on | (xj, Xj) | from Q gives, 

Pr [\\W\\f > n2P {\a\P\CPdP + \b\P)] < Pr [n2P (|a|PKxi,Xj)|P + \b\P) > n2P {\a\P\CPdP + \b\P)] 

< expi-Cgd). (4) 


Plugging bounds on ||L)|| (from Q) and ||VF||f (from Q) to upper bound \\Kp\\ < ||L)|| + ||FF||f yields that 
there exists constants Cg and Cq such that, 

Pr [IliFpll > CP\a\PdPn + 2P+^\b\Pn] < Pr [||T»|| + \\W\\f > C^lal^d^n + 2P+^\b\Pn] < exp(-C^d). 

This completes the proof of the theorem. The chain of constants can easily be estimated starting with the constant 
in the definition of the subgaussian random variable. □ 

Remark: Note that for our proofs it is only necessary that xi,... ,Xn are independent random vectors, but 
they need not be identically distributed. This spectral norm upper bound on Kp (again with exponentially high 
probability) could be improved to 

O (cP\a\P{dP + dP^^n) + 2P+\\b\p'^ , 

with a slightly more involved analysis (omitted in this extended abstract). For an even p, the expectation of every 
individual entry of the matrix Kp is positive, which provides tight examples for this bound. 

























Gaussian Kernel. 


We now establish the bound on the spectral norm of a Gaussian kernel random matrix. Again 

i}d 


assume xi,..., x„ are independent vectors drawn according to a centered subgaussian distribution over M“. Let 
Kg denote the kernel matrix obtained using xi,..., x„ in a Gaussian kernel. Here an upper bound of n on the 
spectral norm on the kernel matrix follows trivially as all entries of Kg are less than equal to 1. We show that this 
bound is tight, in that for small values of a, with high probability the spectral norm is at least n(n). 

In fact, it is impossible to obtain better than 0{n) upper bound on the spectral norm of Kg without additional 
assumptions on the subgaussian distribution, as illustrated by this example: Consider a distribution over 
such that a random vector drawn from this distribution is a zero vector (0)'^ with probability 1/2 and uniformly 
distributed over the sphere in of radius 2^/~d with probability 1/2. A random vector x drawn from this 
distribution is isotropic and subgaussian, but Pr[x = (0)'^] = 1/2. Therefore, in xi,... ,x„ drawn from this 
distribution, with high probability more than a constant fraction of the vectors will be (0)*^. This means that a 
proportional number of entries of the matrix Kg will be 1, and the norm will be 0{n) regardless of a. 

This situation changes, however, when we add the additional assumption that xi,... ,x„ have independent 
centered subgaussian coordinate^(i.e., each x* is drawn from a product distribution formed from some d centered 
univariate subgaussian distributions). In that case, the kernel matrix Kg is a small perturbation of the identity 
matrix, and we show that the spectral norm of Kg is with high probability bounded by an absolute constant (for 


a = Q{logn/d)). For this proof, similar to Theorem 3.2 we split the kernel matrix into its diagonal and off- 


diagonal parts. The spectral norm of the off-diagonal part is again bounded by its Frobenius norm. We also verify 
the upper bounds presented in the following theorem by conducting numerical experiments (see Figure [T(b)] ). 

Theorem 3.3. Let xi,..., x„ G be independent centered subgaussian vectors. Let a > 0, and let Kg be the 
nxn matrix with (z, j)th entry Kg^. = exp(— a||xi — Xj |p). Then there exists constants c, cq, Cg, ci such that 

a) \\Kg\\ < n. 

b) If a < ci/d, Pr [||iTc,|| > con] > 1 — exp(—Cgn). 

c) If all the vectors xi,..., x„ satisfy the additional assumption of having independent centered subgaussian 
coordinates, and assume n < exp{Cid) for a constant Ci. Then for any 6 > 0 and a > (2 + (5)^^^, 
P^[\\Kg\\ > 2] < exp(— with C > 0 depending only on 6 . 

Proof Proof of Part[^ is straightforward as all entries of Kg do not exceed 1. 

Let us prove the lower estimate for the norm in Part|^. For z = 1,..., n define 


= 


E 

i=?+i 


9ij ■ 


‘0 


From Lemma 3.1 for all z G In], Pr 


|xj|| > Cy/d < exp{—C'd). In other words, ||xj|| is less than Cs/d for 

all z G [d] with probability at least 1 — exp(—C"d). Let us call this event Si. Under Si and assumption a < ci/d, 
^[Zi] > C 2 n and IE[Z/] < can^. Therefore, by Paley-Zygmund inequality (under event Si), 


Pr[Zj > an] > C 5 . (5) 

Now Zi,..., Zn are not independent random variables. But if we condition on x„/ 2 -i-i , ■ ■ ■ t^n, then Zi,..., Z .^12 
become independent (for simplicity, assume that n is divisible by 2). Thereafter, an application of Chernoff bound 
on Zi,..., using the probability bound from Q (under conditioning on x„/ 2 +i ■,■ ■ ■ ■,'^n and event Si) gives: 

Pr \_Zi > C 4 n for at least csn entries Zi G {Zi ,..., Z„/ 2 }] > 1 — exp(—cen). 

®Some of the commonly used subgaussian random vectors such as the standard normal, Bernoulli satisfy this additional assumption. 
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The first conditioning can be removed by taking the expectation with respect to x„/ 2 +i ,■ ■ ■ without disturbing 

the exponential probability bound. Similarly, conditioning on event Si can also be easily removed. 

Let Kg be the submatrix of Kg consisting of rows 1 < i < n/2 and columns n/2 + l<j<n. Note that 

\\ISg\\ > vJ K'gVi, where u = ■ ■ ■ i dimension n/2). Then 


< Con] < Pr[||iT'|| < cyn] < Pr[u'''iT'u < cyn] 



2 = 1 


The last line follows as from above arguments with exponentially high probability above more than SL{n) entries 
in Zi, ..., Zn /2 greater than SL{n), and by readjusting the constants. 

Proof of Part[^: As in Theorem |3.2[ we split the matrix Kg into the diagonal {D) and the off-diagonal part 
(VP) (i.e., Kg = D + VP). It is simple to observe that D = I„, therefore we just concentrate on VP. The 
(i, j)th entry in VP is exp(—ajjxj — Xjjp), where Xj and xj are independent vectors with independent centered 
subgaussian coordinates. Therefore, we can use Hoeffding’s inequality, for fixed i, j. 


Pr [exp(—ajjxj — Xjjp) > exp(—a(l — C)c()] = Pr 



< (1-C) 


< exp(-C8C^d), 


( 6 ) 


where we used fhe fact that if a random variable is subgaussian then its square is a subexponential random 
variable 1271 To estimate the norm of VP, we bound it by its Frobenius norm. If a > (2 + J) ^^1^, then we can 
choose C > 0 depending on 6 such that exp(—a(l — ()d) < 1. Hence, 


PrlljiT^II > 2] < Fr[\\D\\ + ||lP||i. > 2] = Ft[\\W\\f > 1] 
= Pr exp(—ajjxj — Xjjp) > 1 

l<i,j<n,i^j 


< Pr 


< Pr 


exp(—ajjxj — Xjjp) > exp(—a(l — Cjii) 


yy exp(—ajjxj — Xjjp) > exp(—a(l — C)d) 


< Pr 


max exp(—ajjxj — Xjjj ) > exp(—a(l — C)d) 

l< 2 ,jf<n 


< exp(—csC^d) 

< exp(—cC^d) for some constant c. 


The first equality follows as ||i2|| = 1, and the second-last inequality follows from Q. This completes the 
proof of the theorem. Again the long chain of constants can easily be estimated starting with the constant in the 
definition of the subgaussian random variable. □ 


Remark: Note that again the Xj’s need not be identically distributed. Also as mentioned earlier, the analysis in 
Theoremcould easily be extended to other exponential kernels such as the Laplacian kernel. 


°We call a random variable x € ffi subexponential if there exists a constant C > 0 if Pr[|x| > f] < 2 exp(—f/C) for all t > 0. 
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(a) Polynomial Kernel 
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(b) Gaussian Kernel 


Figure 1: Largest eigenvalue distribution for random kernel matrices constructed with a polynomial kernel (left plot) and 
a Gaussian kernel (right plot). The actual value plots are constructed by averaging over 100 runs, and in each run we 
draw n independent standard Gaussian vectors in d = 100 dimensions. The predicted values are computed from bounds in 
Theorems |3.2| and |3.3| (Part[^. The kernel matrix size n is varied from 10 to 10000 in multiples of 10. For the polynomial 
kernel, we set a = 1, & = 1, and p = 4, and for the Gaussian kernel a = 3 log(n) /d. Note that our upper bounds are fairly 
close to the actual results. For the Gaussian kernel, the actual values are very close to 1. 


4 Application: Privately Releasing Kernel Ridge Regression Coefficients 

We consider an application of Theorems |3.2| and |3.3| to obtain noise lower bounds for privately releasing coeffi¬ 
cients of kernel ridge regression. For privacy violation, we consider a generalization of blatant non-privacy |21 
referred to as attribute non-privacy (formalized in |[T5l ). Consider a database D G that contains, for each 

individual i, a sensitive attribute yi G {0,1} as well as some other information Xj G which is assumed to be 
known to the attacker. The ith record is thus (x*, 7 /j). Let X G be a matrix whose ith row is Xj, and let 

y = (yi,..., yn)- We denote the entire database D = (3^|y) where | represents vertical concatenation. Given 
some released information p, the attacker constructs an estimate y that she hopes is close to y. We measure the 
attack’s success in terms of the Hamming distance dniy, y)- A scheme is not attribute private if an attacker can 
consistently get an estimate that is within distance o(n). Formally: 

Definition 5 (Failure of Attribute Privacy lITSl ). A (randomized) mechanism A4 : allow 

{6, 7 ) attribute reconstruction if there exists a setting of the nonsensitive attributes X G and an algorithm 
(adversary) A : x —)• such that for every y G {0,1}”, 

[A{X,p) = y : dH{y,y) < 61] > 1 - 7 . 

p-!-X((X|y)) 

Asymptotically, we say that a mechanism is attribute nonprivate if there is an infinite sequence of n for which 
Ai allows (o(n), o(l))-reconstruction. Here d = d{n) is a function of n. We say the attack A is efficient if it 
runs in time poly(n, d). 

Kernel Ridge Regression Background. One of the most basic regression formulation is that of ridge regres¬ 
sion iTTOl . Suppose that we are given a dataset {(xi,yj)}f^^ consisting of n points with Xj G and y* G M. 
Here Xj’s are referred to as the regressors and yfs are the response variables. In linear regression the task is 
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to find a linear function that models the dependencies between Xj’s and the yj’s. A common way to prevent 
overfitting in linear regression is by adding a penalty regularization term (also known as shrinkage in statistics). 
In kernel ridge regression 1211, we assume a model of form y = /(x) + where we are trying to estimate the 
regression function / and ^ is some unknown vector that accounts for discrepancy between the actual response 
{y) and predicted outcome (/(x)). Given a reproducing kernel Hilbert space % with kernel n, the goal of ridge 
regression kernel ridge regression is to estimate the unknown function /* such the least-squares loss defined over 
the dataset with a weighted penalty based on the squared Hilbert norm is minimized. 

Kernel Ridge Regression: argmin^-g.^ “ /(xi))^ + m\\^, (7) 

where A > 0 is a regularization parameter. By representer theorem ll22ll . any solution /* for Q, takes the form 

n 

/*(•) = 

i=l 

where a = (ai,..., an) is known as the kernel ridge regression coefficient vector. Plugging this representation 
into 0 and solving the resulting optimization problem (in terms of a now), we get that the minimum value is 
achieved for a = a*, where 

a* = (K + AI„)“^y, where K is the kernel matrix with Kij = K{xi, Xj) and y = {yi,..., yn)- (9) 

Plugging this a* from (|9]) in to Q, gives the final form for estimate /*(•). This means that for a new point 
X G the predicted response is /*(x) = Xj) where a* = (Ar +AIn)~^y and a* = (a^,..., a*). 

Therefore, knowledge of a* and xi,..., x„ suffices for using the regression model for making future predictions. 

If K is constructed using a polynomial kernel (defined in 0) then the above procedure is referred to as the 
polynomial kernel ridge regression, and similarly if K is constructed using a Gaussian kernel (defined in 0) 
then the above procedure is referred to as the Gaussian kernel ridge regression. 

Reconstruction Attack from Noisy a*. Algorithmoutlines the attack. The privacy mechanism releases a 
noisy approximation to a*. Let a be this noisy approximation, i.e., d = a* + e where e is some unknown noise 
vector. The adversary tries to reconstruct an approximation y of y from a. The adversary solves the following 
£ 2 -minimization problem to construct y: 


minzg]Rn||d; - (AT-p Ain) ^z||- (10) 

In the setting of attribute privacy, the database D = (A|y). Let xi,..., Xn be the rows of X, using which the 
adversary can construct K to carry out the attack. Since the matrix K + Ain is invertible for A > 0 as AT is a 
positive semidefinite matrix, the solution to ( fTO] ) is simply z = (AT + AIn)d, element-wise rounding of which to 
closest 0,1 gives y. 

Lemma 4.1. Let a = a* + e, where e G M” is some unknown (noise) vector. If ||e||oo < /3 (absolute value of all 
entries in e is less than j3), then y returned by Algorithm^satisfies, dniy, y) < 4(Ar -|- X)‘^f3'^n. In particular, 
if 13 = o then duiy^y) = o(n). 

Proof. Since a* = (K + AIn)“^y, d = (AT -|- AIn)“^y + e. Now multiplying (K + Ain) on both sides gives, 

(AT -|- AIn)d = y + (AT -|- AIn)e. 
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Algorithm 1 Reconstruction Attack from Noisy Kernel Ridge Regression Coefficients 

Input: Public information X € regularization parameter A, and a. (noisy version of a* defined in Q). 


1 : Let xi,..., x„ be the rows of X, construct the kernel matrix K with Kij = K(xj, Xj) 
2 : Return y = (yi,..., yn) defined as follows: 

^ _ f 0 if ith entry in (AT + Ain)® < 1/2 
I 1 otherwise 


Concentrate on || {K + AIn)e||. This can be bound as 


||(iT + AIn)e|| < ||(iT + AIn)||||e|| = (||iT|| + A)||e 


If the absolute value of all the entries in e are less than /3 then ||e|| < A simple manipulation then shows 

that if the above hold then {K + AIn)e cannot have more than 4(||iT|| + entries with absolute value 

above 1/2. Since y and y only differ in those entries where {K + AIn)e is greater than 1/2, it follows that 
dH{y,y) < 4(||iT|| + Setting /3 = o(]|x|f+A) implies d/r(y,y) = o(n). □ 

For a privacy mechanism to be attribute non-private, the adversary has to be able reconstruct an 1 — o(l) 
fraction of y with high probability. Using the above lemma, and the different bounds on ||iF|| established in The¬ 
orems and we get the following lower bounds for privately releasing kernel ridge regression coefficients. 

Proposition 4.2. 1) Any privacy mechanism which for every database D = (X|y) where X G and 

y G {0,1}”^ releases the coefficient vector of a polynomial kennel ridge regression model (for constants a, b, 
and p) fitted between X (matrix of regressor values) and y (response vector), by adding noise to 

each coordinate is attribute non-private. The attack that achieves this attribute privacy violation operates in 
0 {d'n?) time. 

2) Any privacy mechanism which for every database D = (^|y) where X G and y G {0,1}” releases 
the coefficient vector of a Gaussian kennel ridge regression model (for constant a) fitted between X (matrix of 
regressor values) and y (response vector), by adding 0 ( 2 ^) noise to each coordinate is attribute non-private. 
The attack that achieves this attribute privacy violation operates in 0(diifi) time. 


Proof. For Part[T] draw each individual Fs non-sensitive attribute vector Xj independently from any d-dimensional 
subgaussian distribution, and use Lemma 4.1 in conjunction with Theorem |3.2| 

For Part[^ draw each individual Fs non-sensitive attribute vector Xj independently from any product distribu¬ 
tion for med f rom some d centered univariate subgaussian distributions, and use Lemma 4.1 in conjunction with 
Theorem |3.3| (Part |^|7^ 

The time needed to construct the kernel matrix K is 0{dnf), which dominates the overall computation 
time. □ 


We can ask how the above distortion needed for privacy compares to typical entries in a*. The answer is 
not simple, but there are natural settings of inputs, where the noise needed for privacy becomes comparable with 
coordinates of a*, implying that the privacy comes at a steep price. One such example is if the Xj’s are drawn 


"Note that it is not critical for Xi’s to be drawn from a product distribution. It is possible to analyze the attack even under a (weaker) as¬ 
sumption that each individual i’s non-sensitive attribute vector Xi is drawn independently from a d-dimensional subgaussian distribution, 
in conjunction with Theorem|3.3|(Part[^. 


by using Lemma 


4.1 


13 











from the standard normal distribution, y = (1)”, and all other kernel parameters are constant, then the expected 
value of the corresponding a* coordinates match the noise bounds obtained in Proposition 

Note that Proposition |4.2| makes no assumptions on the dimension d of the data, and holds for all values 
of n, d. This is different from all other previous lower bounds for attribute privacy ifTSl @1 [14]|, all of which 
require d to be comparable to n, thereby holding only either when the non-sensitive data (the Xj’s) are very high¬ 
dimensional or for very small n. Also all the previous lower bound analyses lITSl 141 1141 critically rely on the fact 
that the individual coordinates of each of the Xj’s are independent which is not essential for Proposition |4^ 

Note on -reconstruction Attacks. A natural alternative to ( fTO] ) is to use -minimization (also known as “LP 
decoding”). This gives rise to the following linear program: 

minzgRnllo; - (AT-p AIn)“^z||i. (11) 

In the context of privacy, the -minimization approach was first proposed by Dwork et al. lUl, and recently 
reanalyzed in different contexts by ||4l|T4l. These results have shown that, for some settings, the -minimization 
can handle considerably more complex noise patterns than the ^ 2 -minimization. However, in our setting, since 
the solutions for ( [TT] ) and ( fTO] ) are exactly the same (z = {K + Aln)^), there is no inherent advantage of using 
the £i-minimization. 
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